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Abstract: Pattern recognition approaches to the analysis of neuroimaging data have brought new 
applications such as the classification of patients and healthy controls within reach. In our view, the 
reliance on expensive neuroimaging techniques which are not well tolerated by many patient groups 
and the inability of most current biomarker algorithms to accommodate information about prior class 
frequencies (such as a disorder's prevalence in the general population) are key factors limiting practi- 
cal application. To overcome both limitations, we propose a probabilistic pattern recognition approach 
based on cheap and easy-to-use multi-channel near-infrared spectroscopy (fNIRS) measurements. We 
show the validity of our method by applying it to data from healthy controls (n = 14) enabling differ- 
entiation between the conditions of a visual checkerboard task. Second, we show that high-accuracy 
single subject classification of patients with schizophrenia (w = 40) and healthy controls (w = 40) is 
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possible based on temporal patterns of fNIRS data measured during a working memory task. For clas- 
sification, we integrate spatial and temporal information at each channel to estimate overall classifica- 
tion accuracy. This yields an overall accuracy of 76% which is comparable to the highest ever achieved 
in biomarker-based classification of patients with schizophrenia. In summary, the proposed algorithm 
in combination with fNIRS measurements enables the analysis of sub-second, multivariate temporal 
patterns of BOLD responses and high-accuracy predictions based on low-cost, easy-to-use fNIRS pat- 
terns. In addition, our approach can easily compensate for variable class priors, which is highly advan- 
tageous in making predictions in a wide range of clinical neuroimaging applications. Hum Brain Mapp 

34:1102-1114, 2013. © 2012 Wiley Periodicals, Inc. 
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INTRODUCTION 

Recently, the interest in pattern recognition approaches 
to the analysis of clinical neuroimaging data has increased 
substantially. A crucial advantage of multivariate pattern 
recognition algorithms is that they provide predictions on 
the level of individual subjects while making use of infor- 
mation encoded by correlations between voxels (Mar- 
quand et al, 2010; Mourao-Miranda et al., 2005). It is this 
multivariate nature of pattern recognition algorithms that 
results in increased sensitivity over univariate methods 
(Ecker et al, 2009; Marquand et al, 2010) and has led to 
numerous applications in clinical research. Specifically, 
pattern recognition algorithms have shown their potential 
for high-accuracy classification at the level of individual 
subjects involving a number of disorders such as Alzhei- 
mer's disease (Tripoliti et al., 2009), autism (Ecker et al., 
2009), attention deficit hyperactivity disorder (Zhu et al., 
2008), schizophrenia (Kawasaki et al., 2007; for a review 
see Demirci et al., 2008) and major depressive disorder (Fu 
et al., 2008; Hahn et al., 2011; Marquand et al., 2008). 

From this, the question arises why none of these promis- 
ing developments have found their way into practical 
application. In our view, two of the most important factors 
limiting the widespread application of machine learning 
techniques in clinical practice are first, that they commonly 
rely on expensive neuroimaging techniques (mainly MRI) 
not well tolerated by many patients and - though increas- 
ingly available in many hospitals - complexity of bio- 
marker research and application currently still requires 
cooperation with academic research institutions. Second, 
most existing applications do not provide mechanisms to 
accommodate variability in prior class frequencies. In 
other words, they are not well suited to making predic- 
tions in contexts where the frequency of each class differs 
from that observed in the training dataset. One reason this 
is important for clinical studies is to accommodate disease 
prevalence: for most psychiatric disorders, training a clas- 
sifier based on a training set with class frequencies that 
accurately reflect disease prevalence is suboptimal, as it 
can be difficult for the classifier to adequately characterize 
the patient group. Instead, many neuroimaging studies 



train classifiers based on artificially balanced datasets con- 
taining approximately equal numbers of patients and con- 
trols. This approach is suitable for making predictions in 
new datasets with similar class frequencies but it becomes 
problematic if the objective is to make predictions in con- 
texts where the class frequencies are different (Bishop, 
2007). For example, if a classifier is trained to diagnose a 
relatively rare disease using an artificially balanced data- 
set, the classifier will usually result in an unacceptably 
high type I error rate if applied to subjects drawn ran- 
domly from the general population. Overcoming both 
these obstacles are crucial requirements necessary for 
machine learning techniques to become widely used in 
clinical practice. 

To address these issues, we propose a probabilistic spa- 
tiotemporal pattern recognition approach based on high- 
temporal-resolution functional near-infrared spectroscopy 
(fNIRS) data. Our approach enables predictions based 
simultaneously on the spatial pattern and temporal dy- 
namics of the BOLD response and has highly desirable 
characteristics for clinical research studies: fNIRS contains 
information useful for basic and clinical research (Suto 
et al., 2004), it is substantially less expensive than alterna- 
tive imaging modalities (e.g. fMRI), it is well tolerated and 
easily transportable (e.g. to the patient's bedside). Our 
approach is also fully probabilistic, which means predic- 
tions can be easily corrected to compensate for variable 
class priors (e.g. disease prevalence). Ultimately, this 
ensures that inference remains coherent even if the fre- 
quency of each class in the test dataset is substantially dif- 
ferent from the class frequencies observed in the training 
dataset (Bishop, 2007). 

We are not aware of any studies that apply pattern rec- 
ognition methods to fNIRS data, nor of any studies that 
employ the temporal and spatiotemporal classification 
approaches proposed here. Therefore, we first provide a 
proof of concept application to validate our approach, 
where we apply the proposed temporal and spatiotempo- 
ral classifiers to fNIRS data obtained during a visual 
checkerboard task and then assess whether the resulting 
accuracies are higher over the primary visual cortex than 
over adjacent areas. We then apply the algorithm to the 
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diagnostic classification of schizophrenia using spatiotem- 
poral patterns of brain activity during an n-back working 
memory task. Finally, we demonstrate how our approach 
can be used to compensate for variable class priors. We 
hypothesize that combining spatial and temporal informa- 
tion will allow for high-accuracy single-subject classifica- 
tion of schizophrenic patients and controls based on fNIRS 
data and that compensating for class priors will improve 
the predictive value of the classifier in unbalanced 
datasets. 



MATERIALS AND METHODS 

Functional Near-Infrared Spectroscopy 

Near-infrared spectroscopy (NIRS), an optical imaging 
technique (see Obrig and Villringer, 2003), uses near-infra- 
red light (i.e. light with 650-850 nm wavelengths) that read- 
ily penetrates biological tissue (e.g. skin, skull, brain tissue). 
Basically, fNIRS measures changes similar to the blood oxy- 
gen level-dependent (BOLD) effect in functional magnetic 



resonance imaging (fMRI). Compared to fMRI, fNIRS is less 
restrictive, less expensive and less sensitive to motion arte- 
facts and highly accepted by participants as well as easily 
transportable. Therefore it is optimally suited for claustro- 
phobic or elderly participants (Herrmann et al., 2008; Zeller 
et al., 2010), psychiatric patients (Fallgatter et al, 2004; 
Schecklmann et al., 2008b) and children (Dresler et al., 
2009). The method has proven to be highly reliable (Plichta 
et al., 2006a, 2007b; Schecklmann et al., 2008a); however, it 
comes at the cost of only being able to measure cortical 
structures (for an in-depth description of fNIRS, see Hoshi, 
2003; Obrig and Villringer, 2003) (Fig. 1). 

Visual Checkerboard Paradigm 
Participants 

Fifteen subjects were examined (mean age = 25.3 ± 2.6 
years; 8 females). All subjects had normal or corrected to 
normal vision and no history of neurological or psychiatric 
disorders. One subject was excluded from analysis because 
of problems with localizing the visual cortex during the 
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Figure I. 

Schematic arrangement of the fNIRS probe set (red squares: over the occipital cortex, (b) For the n-back task, the inferior 
emitters; blue squares: detectors; numbers: measurement chan- row of the left probe set was oriented towards T3 and Fpl (T4 
nels). For the visual checkerboard paradigm, a functional local- and Fp2 for the right side) according to the international 10-20 
izer was used to position the probeset directly over the primary system (Jasper, 1958). 
visual cortex, (a) Figure illustrates the approximate position 
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functional localizer task. All subjects gave written 
informed consent. The investigation was in accordance 
with the latest version of the Declaration of Helsinki and 
was approved by the Ethics Committee of the University 
of Wuerzburg. Data from this sample have been published 
previously (Plichta et al., 2007a). 

Task description and procedures 

A simple checkerboard paradigm was chosen as it has 
previously shown strong and robust activation effects in 
the primary visual cortex accessible with fNIRS (Hahn 
et al., 2010; Plichta et al., 2007a, 2006b). A parametric 
design with four levels of luminance intensities (Michelsen 
contrasts: 0, 8, 40 and 97%) was conducted, based on find- 
ings from fMRI studies (Avidan et al., 2002; Boynton et al., 
1999; Heeger and Ress, 2002) which have demonstrated 
that an increase of stimulus contrast levels results in a 
monotonic increase of the primary visual cortex (VI) activ- 
ity. In an event-related paradigm, a series of checker- 
boards was presented, each for 2 s, reversing in contrast at 
8 Hz (according to Ozus et al., 2001) followed by an inter- 
stimulus interval (ISI) of a uniform grey colour screen of 
variable duration (ISI = 4-9 s, average: 6.5 s). Further 
details can be found in Plichta et al. (2007a). 

fNIRS data acquisition and extraction 

Measurements were conducted with the ETG-4000 Opti- 
cal Topography System (Hitachi Medical Co., Japan) using 
a 52-channel array of optodes, covering the occipital cortex 
(interoptode distance 30 mm, sampling rate 10 Hz). Ini- 
tially, the probe set was placed with its lowest row centre 
optode at the subjects' inion. 

Pre-processing of the data went as follows: First, the 
raw time-series at each channel was low-pass filtered with 
a cut-off frequency of 0.7 Hz. Then, the time-segment start- 
ing at the beginning of each trial and ending after 18 s 
was extracted. Following this, the average time-course 
over all trials of the same condition was calculated, yield- 
ing one time-segment for each channel and condition. 
Finally, the baseline calculated from the first 3 s preceding 
the onset of the trial was subtracted from each time point 
(these steps are in accordance with Plichta et al., 2007a). 

N-back Task 

Participants 

Forty unrelated chronic schizophrenic patients according 
to ICD-10 criteria (23 male; mean age = 42 ± 12 years; 
mean duration of illness = 16 years ± 11 years) partici- 
pated. All patients were recruited at the Department of 
Psychiatry, Psychosomatics and Psychotherapy, University 
of Wiirzburg, and treated with typical (w = 11) and/or 
atypical antipsychotics (n = 31) as well as antidepressants 
(h = 20) and/or benzodiazepines (n = 11). None of the 



subjects showed significant neurological co-morbidity, 
mental retardation or other somatic disorders. Diagnoses 
were made by an extensive, semi-structured interview 
analogous to the AMDP interview (AMDP, 2000) con- 
ducted by an experienced psychiatrist. Furthermore, 40 
controls (19 male; mean age = 40 ± 16 years) without his- 
tory for any axis-I diagnosis or use of psychotropic medi- 
cation participated. 

Both groups were comparable for age (T = 0.712; df = 
78; p = 0.479) and gender (x 2 = 0.802; df = 1; p = 0.370). 
Subjects gave written consent after detailed explanation of 
the procedures. The study was approved by the Ethics 
Committee of the University of Wiirzburg, and was in ac- 
cordance with the Declaration of Helsinki. Data from this 
sample have been published previously (Reif et al., 2011). 

Task description and procedures 

All subjects performed a letter n-back task. For the 1- 
back condition [low working memory (WM) load], they 
were instructed to press a response button, whenever a 
letter presented on a screen was identical to the preceding 
letter. For the 2-back condition (high WM load), they had 
to respond whenever a letter was identical to the one two 
trials before. The tasks were performed alternately in a 
block-wise fashion and separated by 30-s resting segments 
during which participants were instructed to sit still. Let- 
ters were presented in pseudorandomized order with a 
presentation time of 300 ms and an inter-stimulus interval 
of 1700 ms. Both task conditions were conducted three 
times each, resulting in three 30-s task segments for both 
paradigms. A 10-s baseline period preceded the first task. 
For both tasks, a total of 12 target trials appeared (for fur- 
ther details, see Reif et al., 2011). 

fNIRS data acquisition and extraction 

FNIRS measurements were conducted with the same 
machine (see section fNIRS data acquisition and extraction), 
now using two 22-channel arrays of optodes, covering fron- 
tal areas on the left and right side of the head (12 x 6 cm 
each). The arrays were localized by placing the second to 
last optode of the bottom row onto T3/T4 on the left and 
right side of the head, respectively, and orienting the array 
towards Fpl/Fp2 (according to the International 10/20 sys- 
tem for electrode placement). The fNIRS probe set therefore 
covered lateral parts of the fronto-temporal cortex. Both 
arrays consisted of eight light emitters and seven photo- 
detectors. Slow changes in the fNIRS signal unrelated to 
functional stimulation were removed by a linear fitting pro- 
cedure, using a pre-stimulus baseline (the mean across 10 s 
before the active task period) and a post-stimulus baseline 
(the mean across the last 10 s of the 20-s rest segment) as rec- 
ommended for removing long lasting physiological effects 
unrelated to brain activity due to functional stimulation 
(Ehlis et al., 2005) Then, the time-segment starting at the be- 
ginning of each block and ending after 30 s was extracted. 
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Data were averaged according to the specific task condition 
(2-back, 1-back) channel-wise for each participant, yielding 
one time segment for each channel and condition (these pre- 
processing steps are in accordance with the procedure out- 
lined by Reif et al. (2011)). 

Temporal Pattern Recognition of fNIRS 
Time-Series Data 

Data pre-processing and Gaussian 
process classification 

After the paradigm-specific pre-processing described 
above, the resulting time-course for each condition of each 
paradigm was mean-corrected (i.e. the mean of the time- 
series was subtracted from each data point). From this 
pre-processed time-course, the feature vector (with dimen- 
sionality equal to the number of time-points in the respec- 
tive segment) was constructed. 

We employed Gaussian process classifiers (GPCs) as the 
primary classification approach in this study because they 
are fully probabilistic prediction devices and therefore pro- 
vide predictions that can be easily adjusted to compensate 
for variable class priors (see section Accommodating Vari- 
able Class Priors). In contrast, Support Vector Machines 
(SVMs) do not provide probabilistic predictions although 
ad hoc methods have been proposed to convert SVM pre- 
dictions to probability scores (e.g. Piatt, 1999). However, 
such methods are not ideal as they are not guaranteed to 
accurately approximate the true predictive probability 
density (Tipping, 2001). A second advantage of GPCs is 
that they provide an elegant mechanism for automatic pa- 
rameter optimization. We do not review the details here, 
but in practice this is achieved by finding the model pa- 
rameters that are most likely to have generated the data 
by maximizing the Bayesian model evidence (see Rasmus- 
sen and Williams, 2006). The main advantage of this 
approach is that multiple model parameters can be learned 
efficiently from the data without needing to evaluate 
model performance across a range of parameter settings. 
In contrast SVMs do not provide any methods for auto- 
matic parameter optimization, so all free parameters must 
be either set heuristically or optimized by cumbersome 
and computationally demanding methods, such as nested 
cross-validation, ultimately allowing for only a very lim- 
ited number of parameters to be learned from the data. 
This feature of GPC models underpins the spatiotemporal 
classification approach presented in this paper, where we 
use GPCs to automatically learn a linear combination of 
individual optodes that best predicts the class labels (see 
Supporting Information for details). 

An in-depth treatment of GP inference has been pre- 
sented elsewhere and we do not repeat the details here 
(see Marquand et al., 2010; Rasmussen and Williams, 
2006). In brief, GPC can be considered as a Bayesian exten- 
sion of logistic or probit regression, where a sigmoidal 
likelihood function models the probability of having 



observed each class label (conventionally denoted by 
p(y\x) where y € {—1,1} is the class label and x is a data 
vector). A latent regression function models relationships 
between the data points and inference proceeds by (i) 
placing a Gaussian prior over the latent function, (ii) find- 
ing the optimal covariance parameters for the data, (iii) 
computing the posterior function distribution and integrat- 
ing out the latent function to produce probabilistic predic- 
tions. Class predictions must sum to one, thus for binary 
classification it is sufficient to learn the latent function for 
a single class (i.e. p(y = l|x)). The predictions for the sec- 
ond class can then be derived by p(y = — l|x) = 
l-p(y = l|x). 

We used a customized version of the Gaussian processes 
for machine learning (GPML) toolbox for Matlab (The 
Mathworks, Natick, MA) (http://www.gaussianproces- 
s.org/gpml) for all GPC inference. We employed a similar 
approach to earlier work (Marquand et al., 2010) in that 
we used a probit likelihood function to model the class 
labels, and used the Expectation propagation algorithm 
(Minka, 2001; Rasmussen and Williams, 2006) to compute 
the posterior distribution for the latent function. For classi- 
fication of schizophrenic patients and controls (Experiment 
2), we applied ANOVA-based feature selection (Guyon 
and Elisseeff, 2003) to select the time points of maximum 
group differences between patients and controls in the 
training set using a nested (3-way) cross-validation proce- 
dure as implemented in the PROBID toolbox (www .brain- 
map. co.uk/probid), which ensures that the features were 
selected using the training set only. First the f-value and 
degrees of freedom were calculated for each time-point in 
the training set. Then the t-map was converted into a p- 
map, and time-points with a p-value higher than the 
threshold (uncorrected p = 0.05) were discarded. 

As noted, an important property of GP models is that 
they permit the flexible specification of prior covariance 
functions (i.e. kernels), which we utilized to construct the 
temporal and spatiotemporal classification models (see 
Supporting Information) Briefly, for the spatial classifier, 
we used a simple linear covariance function where data 
dimensions correspond to time points and the covariance 
function measures correlations between subjects across the 
whole time course. For the spatiotemporal classifier, we 
used the GPC to construct an optimal linear combination 
of covariance functions, each constructed from an individ- 
ual optode. To achieve this, we first estimated an inde- 
pendent linear covariance function for each optode 
derived from the whole time course (which models the 
temporal pattern) then learned a weighted combination of 
these to train the spatiotemporal classifier (which models 
the spatial pattern). Note that all model parameters were 
learned using the training set only. 

Leave-one-out cross-validation 

The leave-one-out cross-validation (LOO-CV) done in 
order to assess generalizability and to predict a sample's 



♦ 1106 ♦ 



♦ Probabilistic Biomarker-Based Classification ♦ 



probability of being in class 1 (e.g. to be a patient; which 
we denote by pep = p{y = l|x)) was conducted as follows: 
In each leave-one-out run, we used data from all but one 
sample per class (S-l of the S samples per class) to train 
the classifier. Subsequently, the pep of the remaining pair 
of samples (e.g. one time-segment from a patient and one 
from a control in one condition or the time-segment from 
the 97% contrast condition and one from the 8% condition 
from the same subject), which was so far unseen by the 
algorithm, was calculated. This procedure was repeated S 
times, each time leaving out a different pair of samples, 
yielding each sample's pep for each contrast. Thus, for 
each contrast, we obtained the probability that each sam- 
ple was labelled as a member of class 1 (e.g. a patient or a 
member of the 97% contrast checkerboard condition). This 
procedure was repeated for the data obtained at each one 
of the channels. Then, each sample's pep was thresholded 
at 0.50 to obtain binary class predictions. Accuracy was 
calculated as the ratio of correct predictions over number 
of cases. 

To establish whether the observed GP classification 
accuracies are statistically significant, we ran each GP clas- 
sifier 1000 times with randomly permuted labels and 
counted the number of permutations which achieved equal 
or greater accuracy than the one observed with the true 
labels. The p-value was then calculated by dividing this 
number by 1000. 

Finally, spatial accuracy maps were obtained by map- 
ping the accuracy calculated for each channel independ- 
ently and applying false discovery rate (FDR) correction 
for multiple comparisons. Accuracies which did not sur- 
vive FDR correction (P < 0.05 corrected) were set to 0 in 
the resulting maps to show only channels with significant 
accuracy. 

Accommodating Variable Class Priors 
Background 

In this section, we illustrate why it is essential to accom- 
modate variable class priors if a classifier is to be useful 
for predicting disease state or disease outcome in clinical 
practice. We adopt a didactic perspective and provide a 
simple synthetic example to illustrate, because accuracy 
measurements from diagnostic tests are surprisingly com- 
monly misinterpreted even amongst medical professionals 
(see Gigerenzer and Edwards, 2003). 

In practice, it can be difficult to train a classifier using a 
dataset with class frequencies that accurately reflect the 
prevalence of the disorder, particularly in neuroimaging, 
where data are expensive to acquire and therefore the 
total number of subjects available is usually small. For 
example, for a disease with prevalence of 5% a representa- 
tive training set will contain one patient for every 20 
controls. Thus it is likely that a classifier trained on such a 
dataset will obtain a strong bias toward detecting the 
larger class while not adequately characterizing the 



smaller class (since it has only seen a few examples). 
Instead, most neuroimaging studies employ artificially bal- 
anced training sets that contain approximately equal num- 
bers of each group with the view to applying this trained 
classifier to a different population. Further, neuroimaging 
studies most commonly report sensitivity, specificity and 
accuracy as the primary evaluation metrics. However, it 
should be emphasized that these measures alone do not 
completely describe the performance of the classifier. 
Instead, a more complete representation of classifier per- 
formance can be obtained by considering the entire confu- 
sion matrix that summarizes the errors made by the 
classifier on the test set (Fig. 2). 

As defined in Figure 2, the sensitivity of the classifier 
can be stated as the probability of obtaining a positive test 
result (i.e. classifier prediction) given that the disease is 
present. Similarly, the specificity is the probability of 
obtaining a negative test result given that the disease is 
absent. These are useful to determine if the classifier is 
performing above chance level, but for diagnosis it is the 
positive predictive value (PPV) and negative predictive 
value (NPV) that are more important. Put simply, we wish 
to know what the probability of having the disease is 
given we have a positive test result (i.e. the PPV) or the 
probability of not having the disease given we have a neg- 
ative result (NPV). Importantly, the PPV and NPV are 
both sensitive to the prior class frequencies. To see this, 
consider again training a classifier to diagnose a disease 
with prevalence of 5%. Assume that this classifier is 
trained on a perfectly balanced dataset as described above 
where it obtains a sensitivity and specificity of 80%, yield- 
ing a confusion matrix similar to that shown in Table I 
(where both PPV and NPV can also be seen to be 80%). If 
this classifier is then applied to diagnose 100 cases from 
the general population, approximately 95 of these can be 
expected to be controls and approximately five patients. 
From the sensitivity and specificity, we see that this classi- 
fier can be expected to diagnose 0.8 x 5 = 4 of the five 
patients correctly and 0.8 x 95 = 76 of the 95 controls cor- 
rectly. So far this seems reasonable, but given 80% specific- 
ity and the class priors noted above, the classifier can be 
expected to mistakenly predict that 20% of healthy control 
subjects have the disease, which amounts to 0.2 x 95 = 19 
subjects, as shown in the confusion matrix in Table II. In 
this case, the PPV of the test is much poorer and given a 
positive test result, the probability of having the disorder 
is only 4/(4 + 19) « 0.21. 

To summarize this section, it is important to consider 
the PPV and NPV in addition to the sensitivity and speci- 
ficity to describe the performance of a classifier. Further, 
the PPV and NPV from a classifier trained on an approxi- 
mately balanced dataset (e.g. derived from LOO-CV) do 
not reflect those of a population where the classes are 
unbalanced. In such a situation, the PPV on the unbal- 
anced population will be overestimated by the balanced 
classifier if the positive class is smaller (i.e. the type I error 
rate will be underestimated) and the NPV will be 
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Figure 2. 

Confusion matrix summarizing the errors made by the classifier on the test set. 



overestimated if the negative class is smaller (i.e. type II 
error rate will be overestimated). 

Adjusting probabilistic predictions to accommodate 
prior frequencies 

As noted, one of the main motivations for employing 
the GP classifiers proposed in this study is that they fur- 
nish probabilistic predictions, which unlike categorical 
predictions (for example, those provided by SVM) can be 
easily adjusted to compensate for different or variable 
class priors. In practice this can be achieved by a simple 
three-step procedure (Bishop, 2007): (i) dividing the pre- 
dictive probabilities for each class by the their frequency 
in the training set, (ii) multiplying them by their frequen- 
cies in the test set then and (iii) normalizing the class pre- 



dictions so that they sum to one. Intuitively this operation 
has the effect of making the classifier more likely to pre- 
dict the larger class by a factor determined by the relative 
class frequencies. The converse is also true in that the clas- 
sifier is only permitted to predict the smaller class when 
predictive confidence is high. 

Proof of concept demonstration 

Ideally, the approach outlined above should be evaluated 
using new data acquired directly from the target population, 
which unfortunately was not available for the present sam- 
ple. Instead, we demonstrate the approach by simulating an 
unbalanced test set by repeatedly sub-sampling the dataset 
described in section N-back Task (2-back condition only). 
This involved repeatedly performed the following steps: 



TABLE I. Performance of a hypothetical classifier TABLE II. Performance of a hypothetical classifier 

(balanced dataset) (unbalanced dataset) 









True class 






True class 






Patient 


Control 




Patient 


Control 


Predicted class 


Patient 


80 


20 Predicted class 


Patient 


4 


19 




Control 


20 


80 


Control 


1 


76 
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Figure 3. 

Spatial accuracy maps for classification of the 97%-contrast (left panel), the 40%-contrast (middle 
panel), and the 8%-contrast condition (right panel) versus the no-contrast condition (P < 0.05, 
FDR-corrected). Highest accuracies are consistently found over the primary visual cortex. 



Randomly split the data in half, each half containing an 
equal number of patients and controls (and therefore 
the same total number of subjects). 

Train a balanced classifier to discriminate patients from 
controls based on one half of the data (training set) 
Perform an inner CV loop on the remaining data (vali- 
dation set): 

1. Draw a test set from the validation set which con- 
tains more patients than controls by factor F 

2. Apply classifier trained in step 2 to the unbalanced 
test set 

3. Compute predictions and performance metrics 

4. Adjust predictions for class frequencies and recom- 
pute performance metrics 

Repeat steps 3a -d until each patient has been sampled 



We employed the above procedure, so that the test data- 
set was unbalanced by a factor of F = 5, 10 or 20 (i.e. hav- 
ing 5, 10 or 20 controls for every patient which 
corresponds to disease prevalence of 20, 10 and 5%, 
respectively) and report the average performance over 10 
random splits of the data for each level of this factor. For 
comparison, we also computed the predictive performance 
on the entire validation set at once which shows the accu- 
racy that would have been obtained if the same subjects 
were tested using a balanced design (i.e. split-half CV), 
For each classification approach, we report the sensitivity, 
specificity, PPV, NPV, accuracy, balanced accuracy (Bro- 
dersen et al., 2010) and the overall predictive value (OPV), 
which we define as the mean of PPV and NPV (Fig. 2). 
The accuracy describes the proportion of correct predic- 
tions overall and balanced accuracy describes the mean 
proportion of correct predictions for each class (thereby ac- 
commodating the different class frequencies). If the class 
frequencies in the test set are equal (as for the LOO-CV 
approach described in section Leave-one-out cross-valida- 
tion), the accuracy and balanced accuracy are equivalent. 

It is unreasonable to assume that the chance levels for a 
classifier applied to an unbalanced test dataset are equal 
to 50%, so we employed permutation testing to (i) derive 
empirical chance level for each of the classifier perform- 
ance measures noted above and (ii) to assess whether the 
classifiers exceeded chance for the balanced accuracy and 



OPV (the two most appropriate summary measures for 
unbalanced datasets). We achieved this by retraining each 
classifier 1000 times with training labels randomly 
assigned to either of the two groups. This provided a dis- 
tribution of accuracies under the null hypothesis that the 
classifier did not exceed chance. We report the mean of 
this distribution as the empirical chance level for each per- 
formance measure and we assessed significance by (i) 
counting the number of times the permuted balanced ac- 
curacy or OPV was greater than or equal to their true val- 
ues and (ii) dividing each of these numbers by 1000. 



RESULTS 

Visual Checkerboard Paradigm 

Independent GP classification based on the temporal in- 
formation from each of the 52 fNIRS channels revealed 
significant accuracies for the classification of all three con- 
trast conditions versus the no-contrast condition (Fig. 3). 
As hypothesized, the maximum classification accuracy 
was found at channels located over the primary visual cor- 
tex (channels 26, 25 and 15 for high-, medium- and low- 
contrast intensity, respectively) as previously identified 
using the functional localizer (see section Task description 
and procedures). Likewise, a Wilcoxon signed rank test 
revealed that the mean accuracy of all channels over the 
visual cortex region of interest (ROI) was higher than the 
mean accuracy of the channels outside this ROI. This effect 
was present in all three contrast conditions (97% vs. 0% 
contrast: Z = 2.54; P < 0.011; 40% vs. 0% contrast: Z = 
2.52; P < 0.012; 8% vs. 0% contrast: Z = 2.38; P < 0.018). 
The median accuracy of all GP classifiers based on chan- 
nels over the visual cortex was at 93%, 91 and 84% for 
high-, medium- and low-contrast intensity, respectively. 



Classification in the Context of Schizophrenia 

Using the temporal information from the two probe sets 
each comprising the 22 fNIRS channels for both condi- 
tions, revealed significant accuracies for the classification 
of schizophrenic patients versus healthy controls. Signifi- 
cant classification accuracy was found for the 2-back 
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Figure 4. 

(a) Schematic arrangement of the fNIRS probe set (red squares: emitter; blue squares: detectors; 
numbers: measurement channels), (b) Spatial accuracy maps for the classification of schizo- 
phrenic patients and healthy controls (2-back condition; P < 0.05, FDR-corrected). 



condition at channels located over inferior fronto-temporal 
areas, extending into the dorsolateral prefrontal cortex 
showing accuracies as high as 78% (channels 2, 3, 4, 7, 8, 
9, 12, 17, 20, 22 in the left probe set and channels 1, 2, 3, 4, 
6, 7, 8, 9, 13, 18 in the right probe set; Fig. 4). 

In the 1-back condition, a number of channels also 
showed significant accuracy (channels 14, 19, 22 in the left 
probe set and channels 1, 2, 17, 18, 21 in the right probe set). 

Integrating spatial and temporal information using the 
spatiotemporal classification approach leads to an accuracy 
of 76% (P < 0.001; sensitivity = 80%; specificity = 72.5%, 
PPV = 73.8%, NPV = 76.3%) for the 2-back condition and 
to an accuracy of 67.5% (P < 0.001; sensitivity = 70%; 



specificity = 65%, PPV 
back condition. 



66.7%, NPV = 68.4%) for the 1- 



Accommodating Variable Class Priors 

Classifier performance measures derived from the artifi- 
cially unbalanced datasets are summarized in Tables III-V. 
The split-half accuracy was overall slightly lower than the 
LOO-CV accuracy, probably because of the smaller train- 
ing set split half CV entails. As expected, NPV increases 
monotonically and PPV decreases as the test set becomes 
more unbalanced with PPV becoming very poor for the 



TABLE III. Classifier performance for a balanced training set and unbalanced test set (five patients for every 

control) 





Sensitivity 


Specificity 


PPV 


NPV 


Accuracy 


Bal. Acc. 


OPV 


Split half 


73.0 (1.0) 


69.0 (0.7) 


70.4 (0.4) 


72.4 (0.8) 


71.0 (0.4) 


71.0 (0.4) 


71.4 (0.4) 


Not adjusted 


71.5 (1.1) 


73.1 (0.9) 


38.1 (1.1) 


87.6 (2.0) 


73.2 (0.7) 


72.3 (0.5) 


62.8 (1.3) 


Adjusted 


58.5 (0.6) 


87.1 (0.7) 


55.3 (1.6) 


91.3 (0.1) 


82.3 (0.5) 


72.8 (0.3) 


73.3 (0.8) 


Chance: n. adj. 


49.9 


50.0 


16.6 


83.3 


50.0 


49.9 


58.3 


Chance: adj. 


24.4 


75.4 


16.6 


83.3 


66.9 


49.9 


58.3 



All statistics are reported as percentages and numbers in brackets indicate standard errors across 10 random splits of the data. Chance 
levels on the unbalanced test set for the adjusted and non-adjusted classifiers are also reported. PPV = positive predictive value, NPV 
= negative predictive value, OPV = overall predictive value. 
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TABLE IV. Classifier performance for a balanced training set and unbalanced test set (ten patients for every 

control) 

Sensitivity Specificity PPV NPV Accuracy Bal. Acc. OPV 



Split half 


74.0 (1.0) 


67.5 


Not adjusted 


74.0 (1.0) 


68.3 


Adjusted 


51.5 (0.7) 


88.5 


Chance: n. adj. 


50.1 


50 


Chance: adj. 


16.2 


83 



(0.9) 


69.7 (0.4) 


72.9 (0.9) 


(0.9) 


20.1 (0.6) 


96.2 (0.2) 


(0.7) 


40.7 (2.5) 


89.4 (1.4) 


.0 


9.1 


90.1 


.5 


9.0 


90.1 



70.8 (0.4) 70.8 (0.4) 71.0 (0.5) 

68.8 (0.8) 71.1 (0.5) 58.1 (0.3) 

85.4 (0.6) 70.0 (0.5) 65.0 (1.6) 
50.0 50.0 54.6 
77.4 49.9 54.4 



All statistics are reported as percentages and numbers in brackets indicate standard errors across ten random splits of the data. Chance 
levels on the unbalanced test set for the adjusted and non-adjusted classifiers are also reported. PPV = positive predictive value, NPV 
= negative predictive value, OPV = overall predictive value. 



most unbalanced dataset. Further, the PPV derived from 
the (balanced) split half CV approach becomes increas- 
ingly unrepresentative of the performance of the classifier 
for the unbalanced data. Crucially, when the predictions 
are adjusted to compensate for class priors in the test set, 
PPV improves in every case and the NPV of the adjusted 
classifier remains approximately equivalent to the non- 
adjusted classifier. The accuracy from the classifier also 
increases substantially after the predictive probabilities 
have been adjusted, indicating an increase in the total 
number of correct predictions. However, this is not of in- 
terest from a diagnostic perspective since it mainly reflects 
the imbalance in the test set given that the adjusted classi- 
fier is more likely to correctly predict membership of the 
larger class. In contrast, the balanced accuracy remains vir- 
tually identical for all classifiers, indicating that the aver- 
age proportion of correct predictions for each class 
remained constant before and after the adjustment of the 
predictive probabilities. At first glance, the sensitivity of 
the adjusted classifier appears relatively poor but, it 
should be emphasized that the sensitivity reflects the pro- 
portion of correct predictions the classifier obtained 
against the weight of evidence that most cases are 
expected to be derived from the larger class. To emphasize 
this point, comparison with the empirical chance levels 
(Table III) shows that the sensitivity and specificity of the 
adjusted classifier exceeded chance in every case. 

The permutation test for the balanced accuracy on the 
unbalanced test datasets showed that both the adjusted 
and the non-adjusted classifiers exceeded chance level for 



F = 5, 10 and 20. Further, the permuted balanced accuracy 
never exceeded its true value, corresponding to a signifi- 
cance of P = 0.001 for each classifier. The permutation test 
for the OPV showed that in all cases the adjusted classifier 
produced greater OPV than would be expected by chance 
(P = 0.002, P = 0.006 and P = 0.001 respectively for F = 5, 
10 and 20). In contrast, the OPV for the non-adjusted clas- 
sifier did not exceed chance for any level of F (P = 0.08, 
P = 0.07 and P = 0.36, respectively), which as expected 
reflects the poor performance of the non-adjusted classifier 
in an unbalanced test setting. 



DISCUSSION 

Using the newly developed temporal pattern classifier 
for fNIRS, we showed that temporal information is suffi- 
cient to differentiate between conditions in a visual check- 
erboard task. Further validating the approach, high 
accuracies were specifically found at channels over the pri- 
mary visual cortex while classification accuracy was lower 
in other regions. Most importantly, GP classification based 
solely on the temporal dynamics during an n-back task 
allowed for significant single-subject classification of schiz- 
ophrenic patients and healthy controls. Furthermore, the 
maximum accuracy of 78% and the combined accuracy of 
76% obtained from spatial integration are in the range of 
the highest ever achieved in biomarker-based classification 
of schizophrenic patients. In addition, our accuracy estima- 
tion is based on one of the largest schizophrenic samples 



TABLE V. Classifier performance for a balanced training set and unbalanced test set (20 patients for every control) 





Sensitivity 


Specificity 


PPV 


NPV 


Accuracy 


Bal. Acc. 


OPV 


Split half 


71.0 (0.7) 


68.5 (1.0) 


70.9 (0.5) 


70.4 (0.5) 


69.8 (0.5) 


69.8 (0.5) 


70.7 (0.4) 


Not adjusted 


71.0 (0.7) 


69.5 (0.9) 


11.0 (0.2) 


94.9 (1.1) 


69.1 (0.8) 


70.3 (0.4) 


53.0 (0.6) 


Adjusted 


48.5 (1.1) 


90.0 (0.7) 


29.3 (2.8) 


97.2 (0.0) 


89.8 (0.4) 


69.3 (0.3) 


63.3 (1.4) 


Chance: n. adj. 


50.4 


50.0 


4.8 


95.3 


50.0 


50.1 


52.4 


Chance: adj. 


14.8 


85.2 


4.8 


95.2 


81.9 


50.0 


52.4 



All statistics are reported as percentages and numbers in brackets indicate standard errors across 10 random splits of the data. Chance 
levels on the unbalanced test set for the adjusted and non-adjusted classifiers are also reported. PPV = positive predictive value, NPV 
= negative predictive value, OPV = overall predictive value. 
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ever considered in functional biomarker research of this 
disorder (for a review and methodological considerations 
regarding previous studies, see Demirci et al., 2008). 
Finally, we show that the predictions derived from our 
probabilistic classification approach can be easily adjusted 
to compensate for variable class priors, which is highly ad- 
vantageous if it is necessary to make predictions in differ- 
ent populations from those with which the classifier was 
trained. 

Specifically considering the results of the visual checker- 
board experiment, the algorithm appears to be sensitive to 
task-related differences in the temporal dynamics of the 
BOLD response. As expected, accuracies were higher over 
the visual cortex than over other regions. While these 
results demonstrate sufficient sensitivity and specificity of 
the algorithm, a number of questions arise: Generally, the 
effects observed in the visual checkerboard experiment are 
relatively large and regularly observable in the time-series 
of single trials with the naked eye. It remains to be investi- 
gated how reliability impacts classification. This is particu- 
larly important as pattern recognition requires a more or 
less reliable pattern in single samples (i.e. for every subject 
and condition). Generally, most classifiers including the 
one used in our approach only assume that the training 
samples are independent and identically distributed. 

Additionally, it might be of interest which properties of 
the time-series drive classification. In analogy to the spatial 
weight-mapping method described by Marquand et al. 
(2010), temporal weight-mapping could quantify the contri- 
bution of each time-point to the formation of the decision 
boundary. When the distribution of the samples with 
respect to the decision boundary is of interest, temporal 
g-maps in analogy to the spatial g-maps described in Mar- 
quand et al. (2010) could also be of interest (an example of 
each temporal map can be found in Supporting Informa- 
tion, Fig. 1). One should, however, be aware that the maps 
represent a multivariate temporal pattern, and one ought to 
be cautious about interpreting time-points or time-periods 
in isolation. Coefficient scores (weights) for each time-point 
should thus be interpreted in the context of the entire multi- 
variate pattern. Nonetheless, if hypotheses about the tem- 
poral pattern exist, discrimination mapping might prove 
helpful in visualizing what the classifier has learned. 

As we perform GP classification at each fNIRS channel 
independently, it also needs to be noted that the same 
accuracies might result from different temporal patterns. 
In standard GLM group analyses, a similar phenomenon 
is evident, for instance, when one or more derivatives are 
included in the model. In such cases, the same beta-value 
might be estimated even though the latency or the disper- 
sion of the subjects' hemodynamic response can differ. 

Differentiating between schizophrenic patients and 
healthy controls, the temporal pattern recognition algo- 
rithm demonstrates its suitability for predicting group 
membership for individual subjects. Further validating the 
approach, it yields above chance accuracies in those 
regions which have previously been associated with the 



n-back task in group analyses comparing schizophrenic 
patients and controls (e.g. Reif et al, 2011). In particular, 
temporal information in the inferior fronto-temporal areas 
is relevant for classification. As other approaches using for 
instance large anatomical (Davatzikos et al., 2005; Job 
et al, 2006; Kawasaki et al., 2007; Nakamura et al., 2004; 
Takayanagi et al., 2010) or functional MRI data sets (Cal- 
houn et al., 2008; Pokrajac et al., 2005; Shen et al., 2010) do 
not substantially outperform our approach which is based 
on a comparatively small amount of data per subject, 
using temporal information in biomarker-based classifica- 
tion appears highly promising. 

Notwithstanding the need for further investigation of 
the properties of our approach using different experimen- 
tal designs (i.e. segment lengths, particularly in rapid 
event-related designs), the advantages of temporal pattern 
classification appear substantial: From a methodological 
point of view, the approach can be considered model-free 
in the sense that no prior assumptions about parameters 
of the hemodynamic response (such as shape or latency) 
are required. The GP classifier merely learns to differenti- 
ate temporal patterns which can then be applied to predict 
class memberships based on new data. In contrast to ordi- 
nary regression analysis, temporal (auto-) correlations 
within the data are therefore not only unproblematic in 
GP classification, but temporal structure (i.e. regular pat- 
terns or oscillations) can be learned and utilized to 
increase classification accuracy. In fact, performing GP 
classification on the Fourier transformed (frequency) data 
is equivalent to conducting the analysis on the original 
data. In this context, future studies could furthermore use 
specific priors to test hypotheses concerning the discrimi- 
native power of specific parts of the frequency spectrum. 

A key feature of the probabilistic approach to classifica- 
tion is that it aims to be coherent about managing uncer- 
tainty throughout the analysis, which ultimately provides 
predictions that accurately reflect the confidence of the 
classifier decision. We demonstrated in this work that a 
simple numerical adjustment of these probabilistic predic- 
tions enables the classifier to easily compensate for vari- 
able class priors. For the proof-of-concept application we 
presented here, the adjusted classifier showed significant 
predictive value even in the most unbalanced test set. Fur- 
ther, without this adjustment the classifier was unable to 
produce above chance predictive value in any of the 
unbalanced test settings, despite achieving impressive sen- 
sitivity and specificity and a balanced accuracy signifi- 
cantly above chance. This approach is potentially 
beneficial for improving the predictive value (i.e. limiting 
the error rate) of the classifier in a wide range of clinical 
neuroimaging contexts where the class priors may be vari- 
able (e.g. predicting relapse, remission or treatment 
response in addition to diagnosis). Other advantages of 
this approach are that it does not require the class fre- 
quencies in the test set to be specified during training, and 
there is no need to retrain the classifier if the class fre- 
quencies in the test set change. Note that this adjustment 
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cannot be performed if the classifier does not provide 
probabilistic predictions even if the classifier provides a 
numerical decision function that serves as a surrogate for 
classifier confidence (since the decision function may be 
scaled arbitrarily). There is also no reason why this 
approach is limited to GPC and it is suitable for any classi- 
fier that provides probabilistic predictions. However we 
favour GPC in practice because in previous work, we 
found that it produced more accurate predictions for fMRI 
data than alternative probabilistic classifiers such as the 
relevance vector machine (Marquand et al v 2010). In addi- 
tion, the capability of GPC to automatically estimate pa- 
rameters from the data is very useful in a wide range of 
neuroimaging contexts and underpins the spatiotemporal 
classification approach proposed in this paper. 

In addition, the temporal classifier can be used to obtain 
spatial maps showing where in space temporal dynamics 
are most informative. As the temporal classifiers are run in- 
dependently at each spatial location of interest, the resulting 
spatial maps (i.e. the accuracy maps) are not multivariate in 
nature and can thus be interpreted more easily than those 
commonly obtained using spatial or spatial-temporal pat- 
tern recognition approaches (Mourao-Miranda et al., 2007). 
While more computationally demanding, our approach 
thereby addresses a fundamental issue regarding the inter- 
pretation of spatial maps obtained from other biomarker- 
based machine learning approaches to neuroimaging. 

From the point of view of practical application, the algo- 
rithm in combination with fNIRS measurements enables 
the analysis of sub-second, multivariate temporal patterns 
within the BOLD response and thus allows for an investi- 
gation of potentially clinically relevant temporal dynamics 
differentiating patients and healthy individuals or task 
related mental states within patient populations. As our 
approach provides the classification probability for each 
participant, knowledge about prevalence of a disorder can 
easily be incorporated. Considering the ecological validity 
of fNIRS (absence of (1) horizontal position, (2) head-fixa- 
tion, (3) anxiety inducing environment in a narrow scan- 
ner, (4) loud measurement sequences) combined with the 
low sensitivity for movement artefacts and the ease and 
low cost with which fNIRS can be used with virtually any 
patient group "at bedside", biomarker-based classification 
as described here might be of particular importance for 
practical applications in the future. Most importantly, 
these practical applications in the field of psychiatry can 
not only consist in the differentiation between patients 
with a certain disorder from healthy controls, but in the 
prediction of the success of a specific therapy in an indi- 
vidual patient. 
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